A Method to Quantify Corpus Similarity and its Application to Quantifying the Degree of Literality in a Document
نویسنده
چکیده
Comparing and quantifying corpora is a key issue in corpus based translation and corpus linguistics, for which there is still a notable lack of measures. This makes it difficult for a user to isolate, transpose, or extend the interesting features of a corpus to other NLP systems. In this work we address the issue of measuring similarity between corpora. We suggest a scale between two user chosen corpora on which any third given corpus can be assigned a coefficient of similarity, based on the cross-entropy of statistical N-gram character models. A possible application of this framework is to quantify similarity in terms of literality (or conversely, orality). To this end we carry out experiments on several well-known corpora in both English and Japanese language, and show that the defined similarity coefficient is robust in terms of language and model order variations. Comparing it to other existing similarity measures shows similar performance while extending widely the range of application to electronic data written in languages with no clear word segmentation. Whithin this framework we further investigate the notion of homogeneity in the case of a large multilingual resource.
منابع مشابه
Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملبررسی نقش انواع بافتار همنویسهها در تعیین شباهت بین مدارک
Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...
متن کاملA Geometric View of Similarity Measures in Data Mining
The main objective of data mining is to acquire information from a set of data for prospect applications using a measure. The concerning issue is that one often has to deal with large scale data. Several dimensionality reduction techniques like various feature extraction methods have been developed to resolve the issue. However, the geometric view of the applied measure, as an additional consid...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJTHI
دوره 2 شماره
صفحات -
تاریخ انتشار 2006